For this project, we will follow the DCOVAC process. The process is listed below:
DCOVAC – THE DATA MODELING FRAMEWORK
The beauty industry is very competitive with these huge beauty retailer such Macy, Nordstrom and Sephora. However, Sephora need to focus more on the customer feedback of the brands in order to analyze the current trend in consumer purchase behaviour regarding the price of the product in Sephora. This sephora data are from 2020 that we will used for our analysis. Our objective is to determine which predictors impact the on the price of the products.
This dataset has 6108 rows and 11 variables.There are 6 categorical variables and 5 continious variables.
Alharbi, Raghad. (2020).Sephora website. Kaggle. https://www.kaggle.com/datasets/raghadalharbi/all-products-available-on-sephora-website/data
VARIABLES TO PREDICT WITH:
VARIABLES WE WANT TO PREDICT:
SIZE RATING NUMBER_OF_REVIEW LOVE
Min. :0.0000 Min. :0.000 Min. : 0.0 Min. : 0
1st Qu.:0.0000 1st Qu.:4.000 1st Qu.: 11.0 1st Qu.: 1800
Median :1.0000 Median :4.000 Median : 48.0 Median : 5350
Mean :0.6721 Mean :3.999 Mean : 313.8 Mean : 19044
3rd Qu.:1.0000 3rd Qu.:4.500 3rd Qu.: 226.2 3rd Qu.: 16100
Max. :1.0000 Max. :5.000 Max. :19000.0 Max. :1300000
PRICE VALUE_PRICE ONLINE_ONLY EXCLUSIVE
Min. : 2.0 Min. : 2.00 Min. :0.0000 Min. :0.0000
1st Qu.: 24.0 1st Qu.: 24.00 1st Qu.:0.0000 1st Qu.:0.0000
Median : 34.0 Median : 35.00 Median :0.0000 Median :0.0000
Mean : 49.8 Mean : 51.39 Mean :0.1938 Mean :0.2924
3rd Qu.: 59.0 3rd Qu.: 60.00 3rd Qu.:0.0000 3rd Qu.:1.0000
Max. :549.0 Max. :549.00 Max. :1.0000 Max. :1.0000
LIMITED_EDITION LIMITED_TIMES_OFFER CAT.PRICE
Min. :0.0000 Min. :0.0000000 Min. :0.0000
1st Qu.:0.0000 1st Qu.:0.0000000 1st Qu.:0.0000
Median :0.0000 Median :0.0000000 Median :0.0000
Mean :0.1013 Mean :0.0004912 Mean :0.3011
3rd Qu.:0.0000 3rd Qu.:0.0000000 3rd Qu.:1.0000
Max. :1.0000 Max. :1.0000000 Max. :1.0000
The summary statistics table show that variables like RATING and PRICE have a broad range of values, indicating variety in product ratings and pricing. NUMBER_OF_REVIEW shows a significant difference between mean and median results, indicating a skewed distribution most likely caused by a few products with high review counts. Based on the quartile and maximum values of LIMITED_EDITION and EXCLUSIVE, there is a limited amount of variation in these categories.
The LIMITED_EDITION, SIZE, ONLINE_ONLY, EXCLUSIVE, LIMITED_TIMES_OFFER, CAT.PRICE columns has 1 and 0.we will convert the categorical variables to be factors. This will be easier to compare and analyze between two and more groups.
# A tibble: 2 × 2
LIMITED_EDITION n
<chr> <int>
1 0 5489
2 1 619
According to the graph above, we can see that we have the price less than 50 coded as (0) more than the price greater than 50. Based on the facet graph, we can see that the predictors related to the CAT.PRICE are NUMBER_OF_REVIEW and LOVE variables. However, they both are strong negative correlation that is statically significant.And the VALUE_PRICE are the strong positive correlation to CAT.PRICE since its value are almost the same as price.
According to the graph, we can see that most price are mostly around in between the value of $0- $50. Based on the facet graph, we can see the potential predictors related to PRICE are the CAT.PRICE. Most than that, their correlation related to PRICE are significant weak but SIZE,NUMBER_OF_REVIEW and LOVE are statistically significant.
We want to determine whether the price of a product impacts how much people love it. As observed in the graph, as the price increases, the number of people who love the product decreases. This indicates that customers at Sephora mostly purchase products within the price range of $0 to $100. However, none of the variables strongly correlate with PRICE, as shown by the facet graph analysis. Despite this, NUMBER_OF_REVIEW and LOVE show a weak negative correlation with PRICE, but they are both statistically significant.
For this analysis we will use a Linear Regression Model.
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| VALUE_PRICE | 0.946 | 0.002 | 465.645 | 0.000 |
| LIMITED_EDITION1 | -8.150 | 0.334 | -24.394 | 0.000 |
| EXCLUSIVE1 | -1.715 | 0.224 | -7.673 | 0.000 |
| ONLINE_ONLY1 | -0.961 | 0.253 | -3.794 | 0.000 |
| (Intercept) | 1.508 | 0.457 | 3.298 | 0.001 |
| SIZE1 | 0.506 | 0.211 | 2.401 | 0.016 |
| RATING | 0.226 | 0.100 | 2.268 | 0.023 |
| LOVE | 0.000 | 0.000 | -0.857 | 0.392 |
| LIMITED_TIMES_OFFER1 | 1.659 | 4.387 | 0.378 | 0.705 |
| NUMBER_OF_REVIEW | 0.000 | 0.000 | 0.050 | 0.960 |
After reviewing the regression model, we have identified several predictors that do not significantly impact the prediction of PRICE. Therefore, we will proceed by refining the model to exclude these non-significant predictors.
For this analysis, we will make the finalized version of the Linear Regression Model. We have excluded LOVE, NUMBER_OF_REVIEW, and LIMITED_TIMES_OFFER1 from the model.
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| VALUE_PRICE | 0.946 | 0.002 | 467.488 | 0.000 |
| LIMITED_EDITION1 | -8.133 | 0.334 | -24.374 | 0.000 |
| EXCLUSIVE1 | -1.724 | 0.223 | -7.724 | 0.000 |
| ONLINE_ONLY1 | -0.934 | 0.252 | -3.708 | 0.000 |
| (Intercept) | 1.476 | 0.456 | 3.235 | 0.001 |
| SIZE1 | 0.522 | 0.210 | 2.483 | 0.013 |
| RATING | 0.216 | 0.099 | 2.178 | 0.029 |
After analyzing the new finalize model, we can see that there is a concern with our data. In the Residuals vs. Fitted Plot, we can see a clear pattern in this plot that looks like a curve suggests non-linearity which indicate that the relationship between predictor and the PRICE is not linear. Futhermore, in the Q-Q plot,there is curved pattern that is not lie on the straigt line. This indicates that there is non-normality in the residual that can impact the validity hypothesis tests and those outlier could also impact the model. Therefore, we could improve by managing the outlier that might impact the distribution of residual.
From the following table, we can see the effect on Median Value by the predictor variables.
| Variable | Direction |
|---|---|
| SIZE1 | Increase |
| RATING | Increase |
| VALUE_PRICE | Increase |
| ONLINE_ONLY | Decrease |
| EXCLUSIVE | Decrease |
| LIMITED_EDITION | Decrease |
This nominal logistic model was created all the predictors to determine whether CAT.PRICE is above or below 0.5.The predictors EXCLUSIVE, LIMITED_EDITION, and VALUE_PRICE—all had p-values below the alpha level of 0.1. The model graph indicates that all variables have p-values less than 0.05.
The Training Error Rate is (38+58) / (2520+38+58+1049) = 0.026 = 2.6%
The Training Sensitivity = 1049/1049+58=0.947 = 94.7%
The Validation Error Rate is (28+31)/(1638+28+31+701) = 0.024 = 2.4%
The Validation Sensitivity = 701/701+31= 0.957 = 95.7%
Both training and validation error rate have a low error rate which indicates that the model will perform well with a few mistakes as well as the sensitivity for both training and validation, they have the highest sensitivity rate that indicates the great performance of the model.
In conclusion, our analysis of Sephora website shows a negative correlation between price and consumer affection, suggesting higher-priced items attract less love, highlighting Sephora’s price-sensitive customer base. And the regression model with 97% of adjusted R-squared shows that the effective of predictors in price determination. Nominal logistic regression also shows that whereas features can increase pricing, they do not always turn into more customer love.
---
title: "Sephora Website Data Analysis Project"
output:
flexdashboard::flex_dashboard:
vertical_layout: scroll
source_code: embed
---
```{r setup, include=FALSE, warning=FALSE}
library(flexdashboard)
library(tidyverse)
library(GGally)
library(caret)
library(broom)
```
```{r load_data}
df <- read_csv("Updated Sephora_website_dataset.csv")
```
Introduction {data-orientation=rows}
=======================================================================
Row {data-height=250}
-----------------------------------------------------------------------
### Overview
For this project, we will follow the DCOVAC process. The process is listed below:
DCOVAC – THE DATA MODELING FRAMEWORK
* DEFINE the Problem
* COLLECT the Data from Appropriate Sources
* ORGANIZE the Data Collected
* VISUALIZE the Data by Developing Charts
* ANALYZE the data with Appropriate Statistical Methods
* CONCLUSION
Row {data-height=650}
-----------------------------------------------------------------------
### The Problem & Data Collection
#### The Problem
The beauty industry is very competitive with these huge beauty retailer such Macy, Nordstrom and Sephora. However, Sephora need to focus more on the customer feedback of the brands in order to analyze the current trend in consumer purchase behaviour regarding the price of the product in Sephora. This sephora data are from 2020 that we will used for our analysis. Our objective is to determine which predictors impact the on the price of the products.
#### The Data
This dataset has 6108 rows and 11 variables.There are 6 categorical variables and 5 continious variables.
#### Data Sources
Alharbi, Raghad. (2020).Sephora website. Kaggle. https://www.kaggle.com/datasets/raghadalharbi/all-products-available-on-sephora-website/data
### The Data
VARIABLES TO PREDICT WITH:
* *SIZE*: if the product has a size or not
* *RATING*: The rating of the product
* *NUMBER_OF_REVIEW*: The number of reviews of the product
* *ONLINE_ONLY*: If the product is sold online only
* *EXCLUSIVE*: If the product is sold exclusively on Sephora's website
* *LIMITED_TIMES_OFFER*: If the product has a limited time offer
* *LIMITED_EDITION*: If the product is limited edition
* *LOVE*: The number of people loving the product
* *VALUE_PRICE*: The value price of the product (for discounted products)
VARIABLES WE WANT TO PREDICT:
* *PRICE*:The price of the product
* *CAT.PRICE*: Price > 50 is 1, lower is 0
Data
=======================================================================
Column {data-width=250}
-----------------------------------------------------------------------
### Organize the Data
```{r, cache=TRUE}
colnames(df) <- make.names(colnames(df))
summary(df)
```
The summary statistics table show that variables like RATING and PRICE have a broad range of values, indicating variety in product ratings and pricing. NUMBER_OF_REVIEW shows a significant difference between mean and median results, indicating a skewed distribution most likely caused by a few products with high review counts. Based on the quartile and maximum values of LIMITED_EDITION and EXCLUSIVE, there is a limited amount of variation in these categories.
Column {data-width=250}
-----------------------------------------------------------------------
### Transform Variables
The LIMITED_EDITION, SIZE, ONLINE_ONLY, EXCLUSIVE, LIMITED_TIMES_OFFER, CAT.PRICE columns has 1 and 0.we will convert the categorical variables to be factors. This will be easier to compare and analyze between two and more groups.
```{r, cache=TRUE}
df <- mutate(df,LIMITED_EDITION=as.factor(LIMITED_EDITION),
SIZE =as.factor(SIZE),
ONLINE_ONLY = as.factor(ONLINE_ONLY),
EXCLUSIVE = as.factor(EXCLUSIVE),
LIMITED_TIMES_OFFER=as.factor(LIMITED_TIMES_OFFER),
CAT.PRICE =as.factor(CAT.PRICE))
```
#### LIMITED_EDITION
```{r, cache=TRUE}
as_tibble (select(df,LIMITED_EDITION) %>%
table())
```
#### CAT.PRICE (high or low price)

Data Viz #1
=======================================================================
Column {data-width=700}
-----------------------------------------------------------------------
### Response Variables
#### CAT.PRICE above 50 (1)/below (0)
```{r, cache=TRUE}
as_tibble(select(df,CAT.PRICE) %>%
table()) %>%
ggplot(aes(y=n,x= CAT.PRICE)) + geom_bar(stat="identity")
```
According to the graph above, we can see that we have the price less than 50 coded as (0) more than the price greater than 50. Based on the facet graph, we can see that the predictors related to the CAT.PRICE are NUMBER_OF_REVIEW and LOVE variables. However, they both are strong negative correlation that is statically significant.And the VALUE_PRICE are the strong positive correlation to CAT.PRICE since its value are almost the same as price.
Column {data-width=700}
-----------------------------------------------------------------------
### Transform Variables
```{r, cache=TRUE}
ggpairs(select(df,CAT.PRICE,SIZE,RATING,NUMBER_OF_REVIEW,LOVE,VALUE_PRICE))
```
Data Viz #2
=======================================================================
Column {data-width=700}
-----------------------------------------------------------------------
### Response Variables
#### PRICE
```{r, cache=TRUE}
ggplot(df, aes(x = PRICE)) +
geom_density(fill = "darkgrey", alpha = 0.5) +
labs(title = "Density Plot of Prices", x = "Price", y = "Density")+
scale_x_continuous(breaks = c(0, 50, 100, 200,300,400,500))
```
According to the graph, we can see that most price are mostly around in between the value of $0- $50. Based on the facet graph, we can see the potential predictors related to PRICE are the CAT.PRICE. Most than that, their correlation related to PRICE are significant weak but SIZE,NUMBER_OF_REVIEW and LOVE are statistically significant.
Column {data-width=700}
-----------------------------------------------------------------------
### Transform Variables
```{r, cache=TRUE}
ggpairs(select(df,PRICE,CAT.PRICE,SIZE,RATING,NUMBER_OF_REVIEW,LOVE,LIMITED_EDITION))
```
Data Viz #3
=======================================================================
Column {data-width=700}
-----------------------------------------------------------------------
### Response Variables
#### PRICE
```{r, cache=TRUE}
ggplot(df, aes(x = PRICE, y = LOVE)) +
geom_point(alpha = 0.6, color = "blue", size = 2) +
geom_smooth(method = "lm", color = "red", se = FALSE) +
labs(title = "Impact of Price on Love for the Product",
x = "Price", y = "Love") +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5),
axis.title = element_text(face = "bold"),
axis.text = element_text(color = "gray20"))
```
We want to determine whether the price of a product impacts how much people love it. As observed in the graph, as the price increases, the number of people who love the product decreases. This indicates that customers at Sephora mostly purchase products within the price range of $0 to $100. However, none of the variables strongly correlate with PRICE, as shown by the facet graph analysis. Despite this, NUMBER_OF_REVIEW and LOVE show a weak negative correlation with PRICE, but they are both statistically significant.
Column {data-width=700}
-----------------------------------------------------------------------
### Transform Variables
```{r, cache=TRUE}
ggpairs(select(df,PRICE,RATING,NUMBER_OF_REVIEW,LOVE))
```
PRICE Analysis {data-orientation=rows}
=======================================================================
Row
-----------------------------------------------------------------------
### Predict PRICE
For this analysis we will use a Linear Regression Model.
```{r, include=FALSE, cache=TRUE}
lm <- lm(PRICE ~ . -CAT.PRICE, data = df)
summary(lm)
```
```{r,include=FALSE, cache=TRUE}
tidy(lm)
```
### Adjusted R-Squared
```{r, cache=TRUE}
ARSq<-round(summary(lm)$adj.r.squared,2)
valueBox(paste(ARSq*100,'%'), icon = "fa-thumbs-up")
```
### RMSE
```{r, cache=TRUE}
Sig<-round(summary(lm)$sigma,2)
valueBox(Sig, icon = "fa-thumbs-up")
```
Row
-----------------------------------------------------------------------
### Regression Output
```{r,include=FALSE, cache=TRUE}
summary(lm)$coef
```
```{r, cache=TRUE}
idx <- order(coef(summary(lm))[,4])
out <- coef(summary(lm))[idx,]
knitr::kable(out, digits = 3)
```
### Residual Assumptions Explorations
```{r, cache=TRUE}
plot(lm, which=c(1,2))
```
Row
-----------------------------------------------------------------------
### Analysis Summary
After reviewing the regression model, we have identified several predictors that do not significantly impact the prediction of PRICE. Therefore, we will proceed by refining the model to exclude these non-significant predictors.
Row
-----------------------------------------------------------------------
### Predict PRICE Final Version
For this analysis, we will make the finalized version of the Linear Regression Model. We have excluded LOVE, NUMBER_OF_REVIEW, and LIMITED_TIMES_OFFER1 from the model.
```{r, include=FALSE, cache=TRUE}
new_lm <- lm(PRICE ~ . -CAT.PRICE -LOVE - NUMBER_OF_REVIEW - LIMITED_TIMES_OFFER, data = df)
summary(new_lm)
```
```{r, include=FALSE, cache=TRUE}
tidy(new_lm)
```
### Adjusted R-Squared
```{r, cache=TRUE}
ARSq<-round(summary(new_lm)$adj.r.squared,2)
valueBox(paste(ARSq*100,'%'), icon = "fa-thumbs-up")
```
### RMSE
```{r, cache=TRUE}
Sig<-round(summary(new_lm)$sigma,2)
valueBox(Sig, icon = "fa-thumbs-up")
```
Row
-----------------------------------------------------------------------
### Regression Output
```{r, include=FALSE, cache=TRUE}
knitr::kable(summary(new_lm)$coef, digits = 3)
```
```{r, cache=TRUE}
idx <- order(coef(summary(new_lm))[,4])
out <- coef(summary(new_lm))[idx,]
knitr::kable(out, digits = 3)
```
### Residual Assumptions Explorations
```{r, cache=TRUE}
plot(new_lm, which=c(1,2))
```
Row
-----------------------------------------------------------------------
### Analysis Summary
After analyzing the new finalize model, we can see that there is a concern with our data. In the Residuals vs. Fitted Plot, we can see a clear pattern in this plot that looks like a curve suggests non-linearity which indicate that the relationship between predictor and the PRICE is not linear. Futhermore, in the Q-Q plot,there is curved pattern that is not lie on the straigt line. This indicates that there is non-normality in the residual that can impact the validity hypothesis tests and those outlier could also impact the model. Therefore, we could improve by managing the outlier that might impact the distribution of residual.
From the following table, we can see the effect on Median Value by the predictor variables.
```{r, cache=TRUE}
predchang = data_frame(
Variable = c('SIZE1', 'RATING','VALUE_PRICE','ONLINE_ONLY','EXCLUSIVE','LIMITED_EDITION'),
Direction = c('Increase','Increase','Increase','Decrease', 'Decrease','Decrease'))
knitr::kable(predchang)
```
CAT.PRICE Analysis {data-orientation=rows}
=======================================================================
Column {data-width=250}
-----------------------------------------------------------------------
### Predict CAT.PRICE
This nominal logistic model was created all the predictors to determine whether CAT.PRICE is above or below 0.5.The predictors EXCLUSIVE, LIMITED_EDITION, and VALUE_PRICE—all had p-values below the alpha level of 0.1. The model graph indicates that all variables have p-values less than 0.05.
The Training Error Rate is (38+58) / (2520+38+58+1049) = 0.026 = 2.6%
The Training Sensitivity = 1049/1049+58=0.947 = 94.7%
The Validation Error Rate is (28+31)/(1638+28+31+701) = 0.024 = 2.4%
The Validation Sensitivity = 701/701+31= 0.957 = 95.7%
Both training and validation error rate have a low error rate which indicates that the model will perform well with a few mistakes as well as the sensitivity for both training and validation, they have the highest sensitivity rate that indicates the great performance of the model.
Column {data-width=250}
-----------------------------------------------------------------------

Conclusion
=======================================================================
Row {data-height=900}
-----------------------------------------------------------------------
### Summary
In conclusion, our analysis of Sephora website shows a negative correlation between price and consumer affection, suggesting higher-priced items attract less love, highlighting Sephora's price-sensitive customer base. And the regression model with 97% of adjusted R-squared shows that the effective of predictors in price determination. Nominal logistic regression also shows that whereas features can increase pricing, they do not always turn into more customer love.